Add CausalMask support with new flash attention api #604

ClarkChin08 · 2025-11-03T09:22:43Z

No description provided.

applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp

Copilot

Pull Request Overview

This PR adds support for causal masking in the flash attention implementation by introducing a new SubgroupLayoutQK template parameter and implementing the causal mask logic in the mainloop.

Key Changes:

Added SubgroupLayoutQK template parameter to the collective mainloop and kernel interfaces
Implemented causal masking logic that applies -INFINITY to attention scores beyond the causal boundary
Updated the example runner to conditionally instantiate causal or non-causal configurations based on user options

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp`	Implements causal mask logic and removes the static assertion that previously blocked causal mask usage
`applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp`	Adds subgroup layout type alias and computes sequence coordinates for causal masking
`examples/06_bmg_flash_attention/xe_fmha_fwd_runner.hpp`	Adds `SubgroupLayoutQK` template parameter to mainloop type
`examples/06_bmg_flash_attention/06_xe_fmha_fwd.cpp`	Conditionally selects causal or non-causal kernel based on `is_causal` option

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp

Signed-off-by: Chen, Xi2 <[email protected]>

petercad · 2025-11-10T15:47:58Z

applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp

+            }
+          }
+        }
+      }


@ClarkChin08 Thanks for updating this! By the way, we can make the code even cleaner by including the block offset in cS_thread itself. Something like this should do it:

Tensor gP = local_tile(cP, TileShapeQK{}, blk_qv); auto cS_thread = thr_mma_qk.partition_C(gP);

Then you don't need to do the blocking calculations here; instead row_idx = get<0>(cS_thread(i)), col_idx = get<1>(cS_thread(i)).

Hi @petercad , I changed to use local_tile to get global col and row indices.

petercad · 2025-11-10T16:08:04Z

include/cute/atom/mma_atom.hpp


+  CUTE_HOST_DEVICE constexpr auto
+  get_atom_layout_mnk() const {
+    return atom_layout_mnk_;


Instead of adding a new atom_layout_mnk_ member:

Suggested change

return atom_layout_mnk_;

return AtomLayoutMNK{};

petercad · 2025-11-10T16:34:38Z

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

+      auto discard_seq_coord = s.seq_len_qo - offset;
+      auto full_tile_offset = s.seq_len_kv - offset;
+
+      int seq_coord = cute::min(s.seq_len_qo, (blk_q * get<0>(TileShapeQK{}) + (sub_group_id / get<1>(shape(SubgroupLayoutQK{}))) * SGTileQ));


The sub_group_id / get<1>(shape(SubgroupLayoutQK{})) part is making a strong assumption about how subgroup tiles are arranged within the workgroup tile (K-major). We need to either add a static_assert for this condition, or (better) use CuTe layout algebra to calculate the subgroup Q offset. For instance:

auto cS = make_identity_tensor(take<0,2>(TiledMMAKQ{}.tile_size())); auto tScS = TiledMMAKQ{}.get_slice(thread_idx).partition_C(cS); auto q_offset_wi = get<0>(tScS(0)); /* Q offset for thread */ auto q_offset_sg = group_broadcast(sycl::ext::oneapi::this_work_item::get_sub_group(), q_offset_wi, 0); /* Q offset for SG */

Thank you, @petercad. I'm now implementing the algebraic approach you suggested for calculating q_offset_sg.

Signed-off-by: Chen, Xi2 <[email protected]>

petercad reviewed Nov 3, 2025

View reviewed changes

applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp Outdated Show resolved Hide resolved

petercad reviewed Nov 3, 2025

View reviewed changes

applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp Outdated Show resolved Hide resolved

tdeng5 added the release label Nov 4, 2025

tdeng5 requested review from jiyang1011 and rolandschulz November 5, 2025 03:37

Antonyvance requested a review from Copilot November 5, 2025 07:28

Copilot AI reviewed Nov 5, 2025

View reviewed changes

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp Outdated Show resolved Hide resolved

applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp Outdated Show resolved Hide resolved

ClarkChin08 added 2 commits November 10, 2025 02:05

add CausalMask support with new flash attention api

a1346c5

Signed-off-by: Chen, Xi2 <[email protected]>

refine causal mask in new fa

f8a0514

Signed-off-by: Chen, Xi2 <[email protected]>

ClarkChin08 force-pushed the fa_causal_mask branch 2 times, most recently from bb07ccc to 836f2c4 Compare November 10, 2025 08:14

fix the template args

21a1bce

Signed-off-by: Chen, Xi2 <[email protected]>

ClarkChin08 force-pushed the fa_causal_mask branch from 836f2c4 to 21a1bce Compare November 10, 2025 08:16

petercad reviewed Nov 10, 2025

View reviewed changes

ClarkChin08 added 2 commits November 11, 2025 08:45

refine the code

76cd076

Signed-off-by: Chen, Xi2 <[email protected]>

fix misc

92bd4cb

Signed-off-by: Chen, Xi2 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CausalMask support with new flash attention api #604

Add CausalMask support with new flash attention api #604

ClarkChin08 commented Nov 3, 2025

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

petercad Nov 10, 2025

Uh oh!

ClarkChin08 Nov 11, 2025 •

edited

Loading

Uh oh!

petercad Nov 10, 2025 •

edited

Loading

Uh oh!

ClarkChin08 Nov 11, 2025

Uh oh!

petercad Nov 10, 2025 •

edited

Loading

Uh oh!

ClarkChin08 Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add CausalMask support with new flash attention api #604

Are you sure you want to change the base?

Add CausalMask support with new flash attention api #604

Conversation

ClarkChin08 commented Nov 3, 2025

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

petercad Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

ClarkChin08 Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petercad Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClarkChin08 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClarkChin08 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ClarkChin08 Nov 11, 2025 •

edited

Loading

petercad Nov 10, 2025 •

edited

Loading

petercad Nov 10, 2025 •

edited

Loading